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We present McGenus, an algorithm to predict RNA secondary structures with pseudoknots. The 
method is based on a classification of RNA structures according to their topological genus. Mc- 
Genus can treat sequences of up to 1000 bases and performs an advanced stochastic search of their 
minimum free energy structure allowing for non trivial pseudoknot topologies. Specifically, Mc- 
Genus employs a Monte Carlo algorithm with replica exchange for minimizing a general scoring 
function which includes not only free energy contributions for pair stacking, loop penalties, etc. but 
also a phenomenological penalty for the genus of the pairing graph. The good performance of the 
stochastic search strategy was successfully validated against TT2NE which uses the same free energy 
parametrization and performs exhaustive or partially exhaustive structure search, albeit for much 
shorter sequences (up to 200 bases). Next, the method was applied to other RNA sets, including an 
extensive tmRNA database, yielding results that are competitive with existing algorithms. Finally, 
it is shown that McGenus highlights possible limitations in the free energy scoring function. The 
algorithm is available as a web-server at |http:/ /ipht. cea.fr/rna/mcg enus.php 
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INTRODUCTION 

In the past twenty years, there has been a tremendous 
increase of interest in RNA by the biological community. 
This biopolymer, which was at first merely considered as 
a simple information carrier, was gradually proven to be 
a major actor in the biology of the cell pQ. 

Since the RNA functionality is mostly determined by 
its three-dimensional conformation, the accurate predic- 
tion of RNA folding from the nucleotide sequence is a 
central issue [2]. It is strongly believed that the biolog- 
ical activity of RNA (be it enzymatic or regulatory), is 
implemented through the binding of some unpaired bases 
of the RNA with their ligand. It is thus crucial to have a 
precise and reliable map of all the pairings taking place 
in RNA and to correctly identify loops. The complete 
list of all Watson-Crick and Wobble base pairs in RNA 
is called the secondary structure of RNA. 

In this paper, we stick to the standard assumption that 
there is an effective free energy which governs the forma- 
tion of secondary structures, so that the optimal folding 
of an RNA sequence is found as the minimum free energy 
structure (MFE for short). The problem of finding the 
MFE structure given a certain sequence has been con- 
ceptually solved provided the MFE is planar, i. e. the 
MFE structure contains no pair (k,l) such that 

i<k<j<lork<i<l<j. In that case, poly- 
nomial algorithms which can treat long RNAs assuming 
a mostly linear free energy model have been proposed 
(3H3- Otherwise, the MFE structure is said to contain 
pseudoknots and finding it has been shown to be an NP- 
complctc problem with respect to the sequence length 



In a previous paper [7], we proposed an algorithm, 
TT2NE, which consists in searching for the exact MFE 
structure for a certain form of the energy function, where 
pseudoknots are penalized according to a topological in- 
dex, namely their genus. TT2NE relies on the "max- 
imum weighted independent set" (WIS) formalism. In 
this approach, an RNA structure is viewed as a collec- 
tion of stem-like structures (helices possibly comprising 
bulges of size 1 or internal loops of size lxl), called "he- 
lipoints" [7] , defined in the next section. Given a certain 
sequence, the set of all possible helipoints is enumerated 
and used to build a weighted graph. The graph vertices 
are the helipoints and their weight is given by -1 times 
the helipoint free energy. Two vertices are linked by an 
arc if and only if the corresponding helipoints are not 
compatible in the same secondary structure. Incompati- 
bilities arise, for example, when two helipoints share one 
or more bases as this could imply the formation of base 
triplets, which is forbidden. Finding the MFE structure 
thus amounts to finding the maximum weighted indepen- 
dent set of the graph, i. e. the set of pairwise compatible 
helipoints for which the overall free energy is minimum. 

Both McGenus and TT2NE utilize the same energy 
function, denned in terms of helipoints and genus penalty 
as well as the same initial graph. The difference be- 
tween the two lies in the search algorithm for the MFE. 
While in TT2NE the secondary structure is built by 
adding or removing helipoints in a deterministic or- 
der, in McGenus, they are added or removed one at a 
time according to a stochastic Monte Carlo Metropo- 
lis scheme. As in TT2NE, there is no restriction on 
the pseudoknot topology that McGenus can generate. 
A server implementation of McGenus can be found at 
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|http://ipht.cea.fr/rna/mcgenus.php| 

In the following and in the numerical implementation 
of McGenus, we will restrict ourselves to the energy func- 
tion and genus penalty described in detail in ,7j. While 
in TT2NE, the energy form was dictated by the require- 
ment to allow for a branch and bound procedure, here 
in McGenus we insist that there is no such restriction on 
the form of the energy function. It can for instance in- 
clude loop and pseudoknot entropies. Furthermore, the 
penalty for pseudoknots needs not be proportional to the 
genus as in TT2NE, but may depend also on the topology 
of each individual pseudoknot (see below). Therefore, by 
modifying the energy function, it is possible to improve 
on the results that we will present below. As stated in the 
introduction, the initial graph is generated in the same 
way as in [7J. 

MATERIALS AND METHODS 

In the present framework, the folded structure of a 
given RNA sequence is given by the set of helipoints 
which minimizes the free energy. We recall that a hc- 
lipoint is an ensemble of helices (defined as a stack of 
base pairs possibly comprising bulges of size 1 or inter- 
nal loops of size lxl) that are demarcated by the same 
extremal (initial and terminal) base pairs. Given two ex- 
tremal pairs and (k, I), the set w^J of all helices that 
end with these two pairs can be generated and their in- 
dividual energies calculated according to a given energy 
model. The free energy AF^j of the helipoint is then 
computed as 



exp (- 



exp(-/3e(/i)) 



(1) 



with /3 = {k B T)~ 



where e(h) is the free energy of formation of helix h. 
In our implementation, to speed up the computation of 
this sum, helices of non-negative (i. e. unfavorable) en- 
ergies are neglected, since their Boltzmann weight would 
strongly suppress their contribution. Helipoints are stem- 
like structural building blocks which account for all possi- 
ble internal pairing possibilities that occur between their 
extremal pairs. We shall denote by {hi, h^} the set of 
all helipoints that can possibly arise from the pairings of 
nucleotides in the given sequence (their total number N, 
is clearly sequence dependent). We stress that the set of 
enumerated helipoints comprises all possible helipoints, 
and hence is not restricted to maximal ones. 

Clearly, a given RNA structure S is fully specified by 
a collection of compatible helipoints. It is therefore con- 
venient to identify S with a binary vector, a s , of length 
N and whose i-th component, af takes on the value or 
1 according to whether helipoint hi belongs to S. The 
free energy of S can accordingly be written as: 



JV 



AF(hi)+ng(S) 



(2) 



The first term is the additive contribution of the free 
energy AF of individual helipoints, and is parametrized 
as in [7J . The second term weights the topological com- 
plexity of the structure, measured by its genus g [HI [5]. 
Unlike the first term which is local, the genus, which 
is a non-negative integer, depends globally on all the 
helipoints. The parameter fj, > is used to penalize 
structures with excessively large values of the genus, in 
agreement with the phenomenological observation that 
the genus of most naturally-occurring RNA structures of 
size up to 600 bases, is smaller than 4. Based on previous 
studies [7j , the default value of the genus penalty \i is set 
equal to 1.5 kcal/mol. 

It is implicitly assumed that the free energy of incom- 
patible sets of helipoints is infinite. 



Advanced Monte Carlo search of MFE structures 

The minimization of the free energy ^ is carried out 
by a Monte Carlo (MC) exploration of structure space, 
that is over the set of possible a vectors. Starting from 
a structure S where only one helipoint is present, at 
each Monte Carlo step, one of the helipoints hi is added 
(<7j = — > <Ji = 1) or removed (<7j = 1 — » a, t = 0). The 
helipoint to be modified is picked with a biased proba- 
bility favoring the addition (resp. removal) of helipoints 
with low (resp. high) free energy e. The biasing is in- 
spired by the heat-bath MC algorithm. Specifically, the 
a priori probability to pick helipoint hi to be changed in 
structure S is given by: 



+ (l-of)e" 



-/3AF(hi) 



Y? j =i..N<i + (l-°f)e- /,AFlhl) 



(3) 



where the prime superscript indicates that helipoints in- 
compatible with S are not considered. Changing the 
state of hi defines a trial structure, S', which is accepted 
with probability 



1. 



-P(F S ,-F 3 ) 



(4) 



The above acceptance criterion is a generalization of the 
standard Metropolis rule and ensures that, in the long 
run, the generated structures are sampled with probabil- 
ity given by the canonical weight exp[—/3Fs]. 

The stochastic generation of structures is carried out 
within a Monte Carlo algorithm with replica exchange 
where several simulations are run in parallel at different 
inverse temperatures f3. The values of f3 are chosen so 
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as to cover a range of thermal energies 1/(3, going from 
about one tenth of the smallest helipoint energy up to the 
largest helipoint energy. At regular time intervals, swaps 
are proposed between structures at neighboring temper- 
atures and are accepted with the generalized Metropolis 
criterion described in ref. [TU]. The Markov replicas at 
the lowest temperature progressively populate structures 
of low free-energy, and a record is kept of the lowest en- 
ergy structures which are finally provided as output. 

Finally, we point out that the Monte Carlo optimiza- 
tion can be performed not only within the whole space 
of secondary structures (unconstrained search) but is 
straightforwardly restricted to topologically-constrained 
subspaces. In particular, by introducing ad hoc "infinite" 
energy penalties in eq. [2] the search can be restricted 
to structures whose genus, topology or extent of pseu- 
doknots satisfy some preassigned constraints. The web- 
server interface allows the user to set such thresholds, 
e.g. to account for knowledge based constraints. 

Generalized Topological Penalties 

As we have previously reported [TTJ[TJ], any RNA com- 
plex pseudoknot structure may be built from of a set of 
building blocks, called primitive pseudoknots. A pseudo- 
knot is termed primitive if it is (i) irreducible, i.e. its 
standard diagrammatic representation cannot be discon- 
nected by cutting one backbone line and (ii) contains no 
nested pseudoknot, that is it cannot be disconnected by 
cutting two backbone lines, see Fig.[T] An arbitrary pseu- 
doknotted structure can be decomposed in a collection of 
primitive pseudoknots and its total genus is the sum of 
the genii of its primitive constituents . 



FIG. 1: The only four primitive pseudoknots of genus 1 

Therefore, it makes sense to assign different penal- 
ties to pseudoknots having same genus but with different 
primitive components. For example, all tmRNAs have 
total genus 3 or 4 and contain no primitive pseudoknots 
of genus larger than 1. In the present implementation, 
we propose only two options: i) we forbid primitive pseu- 
doknots of genus larger than 1 (by assigning them an 



infinite penalty) but the overall structure can have any 
total genus or ii) we assign a global penalty proportional 
to the total genus and do not take into account the de- 
composition of the structure into primitive blocks. 

RESULTS AND DISCUSSION 

We have carried out an extensive comparison of Mc- 
Genus predictions against those of other methods. For 
this purpose we used hundreds of RNA sequences from 
various sets, including: the dataset previously used for 
TT2NE [7], an extensive set of tmRNAs [13] and the more 
limited set of pseudoknotted RNA molecules for which 
the structural data is available in the protein databank 
(PDB). Over such diverse datasets, the predictive perfor- 
mance is aptly conveyed by the sensitivity of the method, 
that is the fraction of pairs in the reference (native) struc- 
ture that are correctly predicted by the method. De- 
pending on the context we shall also report on the pos- 
itive predicted value (PPV). The PPV corresponds to 
the fraction of predicted pairs that are found in the na- 
tive structure, and hence measures the incidence of false 
positives in the predicted contacts. We have considered 
this measure for the PDB set, but not for the tmRNA 
set whose entries, often corresponding to putative native 
structures derived from homology, are known to poten- 
tially lack several native contacts, as in the paradigmatic 
case of Aste.yell._TRW-322098_l-426 p3|. A visual rep- 
resentation of this structure can be found in the RNA 
STRAND database Q3] under the reference TMR_00037. 

From an overall point of view, the tests are aimed at 
elucidating two issues that are central to any MFE-based 
method. The first issue, regards the algorithmic effec- 
tiveness of the energy minimization, while the second re- 
gards the viability of the energy parametrization within 
the considered space of secondary structures. The for- 
mer is most clearly ascertained by comparing algorithms 
employing the same energy parametrization. This step is 
crucial for the second aspect too. In fact, the appropri- 
ateness or the limitations of a given energy parametriza- 
tion and/or of the considered secondary structure space, 
can be exposed in a non-ambiguous way only if the min- 
imization algorithm is well-performing. 

Following the above-mentioned logical order, we 
started by comparing the predictions of McGenus against 
TT2NE on a database of 47 short sequences (< 209 
bases) used in |7]. Because McGenus and TT2NE rely 
on the same energy parametrization [IB], the comparison 
provides a stringent test of the effectiveness of the energy- 
minimization procedure. In fact, we recall that TT2NE is 
based on an exhaustive, or nearly exhaustive search in se- 
quence space. Despite the stochastic, non-exhaustive and 
much faster McGenus searches, its performance turned 
out to be optimal. Over the full data set, McGenus re- 
turned exactly the same MFE structures as TT2NE, as 
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well as all the suboptimal structures. 

To extend the assessment of McGenus minimization 
performance for longer chains, that cannot be addressed 
by TT2NE, we considered UNAFold 0], a MFE-based al- 
gorithm restricted to secondary structures without pseu- 
doknots. We used a customized version of UNAFold 
which employs the same energy parametrization as Mc- 
Genus. However, it cannot yet be compared to McGenus 
since it outputs secondary structures in terms of base 
pairs rather than helipoints. To circumvent this diffi- 
culty, we generated all the lowest lying secondary struc- 
tures (within lkCal/mol from the lowest energy struc- 
ture) using the algorithm presented in ref. |15j . To match 
the description of the structure in terms of helipoints, we 
made clusters of secondary structures sharing the same 
extremities of their helical fragments. We then resummed 
them (in terms of their Boltzmann weights) and as a re- 
sult the energy discrepancy between the two approaches 
is negligible. In the sequel, we will refer to this process 
as cUNAFold. 

The comparison was carried out over the complete set 
of 590 sequences of genus 3, 4 or 5 from the tmRNA 
database jT3] with lengths in the 200-500 range. To as- 
sess the efficiency of the minimization algorithm of Mc- 
Genus, we ran it over our sample of 590 sequences, with 
the constraint g max = and compared it with the out- 
put of cUNAFold. The average MFE from McGenus with 
9max = is -105.1 kCal/mol while that of cUNAFold is 
-106.7 kCal/mol. Interestingly enough, out of the 590 
sequences, 191 sequences are predicted to have identical 
secondary structures by both algorithms. This compari- 
son shows the good efficiency of McGenus minimization 
algorithm. 

In the non-zero genus case, for each of the 590 se- 
quences, McGenus returned structures with lower free 
energy than cUNAFold. On the average, the free energy 
of the McGenus predicted structures was -125 kcal/mol. 

These two tests prove the effectiveness of the energy- 
minimization scheme adopted by McGenus and we ac- 
cordingly turned our attention to the overall predictive 
performance of the method (sensitivity). For this pur- 
pose we used again the 590 sequences of genus 3, 4 or 5 
from the tmRNA database [T3] and compared McGenus 
predictions against McQfold [17], HotKnots [18] . Prob- 
Knot 1!) and UNAFold [H] on this set. We did not 
compare McGenus against PKnots [3U] and gfold |21j . as 
the original articles claim that they cannot handle se- 
quences longer than 200 bases. We recall that UNAFold 
predictions are restricted to secondary structures free of 
pseudoknots, while ProbKnot and McQfold can output 
any topology of pseudoknot. The genus of each of Mc- 
Genus prediction was enforced not to exceed the genus 
of the native structures of the dataset. As discussed in 
[7J , the setting of the corresponding parameter g max can 
be decided by the user. In this report, for each test se- 
quence, we chose to set g m ax to the appropriate, native, 



value to illustrate the performance of McGenus when it 
is driven in the appropriate secondary structure search 
space. 

The total number of base pairs to be predicted in the 
set is 56740. The UNAFold, McQfold, ProbKnot, Hot- 
Knots and McGenus arithmetic averages of the sensitiv- 
ity over all sequences are respectively 37%, 42%, 43%, 
39% and 43%, with a respective standard deviation of 
14%, 15%, 14%, 14% and 16%. A closer look at the 
secondary structures output by ProbKnot and HotKnots 
showed that none of them contained any pseudoknot. 
Therefore the performance of McGenus is not inferior 
to that of the few methods that can handle sequences of 
comparable length. Even without resorting to advanced 
comparative tests [23] [H], the consistent sensitivity of 
these 5 algorithms allows to conclude that their perfor- 
mance is very similar. 

The fact that the average sensitivity of the five meth- 
ods is below 50% poses the question of whether it can be 
improved by tweaking the energy parameters or by suit- 
ably further constraining the space of secondary struc- 
tures over which the minimization is performed. We fo- 
cus on the latter aspect as the first has been already 
discussed in [JJ. The space of secondary structures con- 
sidered by prediction schemes based on abstract, graph- 
theoretical representations, include structures that are 
unphysical, i.e. that cannot be realized in a three- 
dimensional space because of chain connectivity con- 
straints. 

The impact of this major difficulty can be lessened 
by excluding from further considerations those structures 
that present physically-unviable or atypical levels of en- 
tanglement. To illustrate this point, we note that, in the 
mentioned dataset of 590 molecules, only H-pseudoknots 
which span less than 70 bases are present. By enforcing 
such knowledge-based constraint on the search space, the 
sensitivity of McGenus is boosted from 43% to 53% with 
a standard deviation of 18%. To assess the statistical sig- 
nificance of this improvement, we performed the Welch 
t-test. We find a t-value of t = 10, which with a total of 
1168 degrees of freedom implies a p— value smaller than 
10 -7 , i.e. the improvement is definitely significant. 

Introducing the constraint in structure space clearly 
results in higher energies for the predicted structures. 
In fact the average free energy was -125 kcal/mol with- 
out the constraint while it is -114 kcal/mol with the re- 
striction of the pseudoknot length. Notwithstanding the 
reduction of the search space due to the pseudoknot- 
lcngth constraint, the structures returned by McGenus 
have an energy that is significantly lower than the ref- 
erence, (putative) native structures, which is about - 
73kcal/mol. The free energy difference appears too large 
to be accounted for by the neglected contribution of 
loop entropy, missing chain-connectivity constraints or 
imperfect parametrization of the potentials, which are 
well established. A more plausible source of discrepancy 
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could the missing contacts in the homology-derived na- 
tive structure of the tmRNA database. 

To check this last point, we have studied the uncon- 
strained version McGenus on a set of 4 sequences from the 
protein databank (PDB) with g max being fixed to the na- 
tive genus. Their PDB ids are: 1Y0Q (length=229, g=l), 
3EOH (length=412, g=l), 2A64 (length=417, g=l) and 
2H0W (length=151, g=2). The structures of these en- 
tries are unambiguously known from X-ray scattering 
data and contain very few long and non-hybridized RNA 
sequences (i.e. not bound to proteins, DNA or other 
molecules). Accordingly, the McGenus performance on 
this set was higher than for the tmRNA set. The sen- 
sitivity for 1Y0Q, 3EOH, 2A64 and 2H0W was equal to 
87%, 39%, 50% and 72%, respectively while the PPV was 
equal to 90%, 38%, 35% and 84%, respectively. Again, 
the structures predicted by McGenus have a lower free en- 
ergy than the native ones. This indicates that, besides ac- 
counting for topological effects, further improvements of 
secondary structure predictions would probably require 
a better parametrization of the free energy. The gen- 
erality and flexibility of the McGenus search algorithm 
ought to allow for incorporating any such modifications 
in a transparent way. 

Finally let us discuss the choice of a maximum genus. 
Ideally, one should perform the computation with a com- 
pletely unconstrained genus. However, there are two diffi- 
culties to this approach. First, since steric constraints are 
only limitedly accounted for by available pseudoknot pre- 
diction algorithms (including McGenus), the predicted 
structures can be sterically impossible and hence associ- 
ated to an excessively high genus. Secondly, the compu- 
tational time required to explore the unrestricted genus 
space could be impractical. To overcome these difficul- 
ties and restrict the search space one can profitably in- 
troduce knowledge-based constraints. In particular, the 
statistical PDB analysis of ref. [11] provides a quantita- 
tive indication for the dependence of the genus on the 
length of naturally-occurring RNA sequences. The data 
can be clearly used to provide a phenomenological upper 
bound to g m ax- Alternatively, a user could explore a few 
different increasing values of g m ax and carry out a su- 
pervised evaluation of the results by taking into account 
(i) the phenomenological constraints and (ii) the possi- 
bility that structures with excessively large genus value 
are returned because of the imperfect treatment of steric 
constraints. 

To illustrate this last point, we ran McGenus on a set 
of 792 5S rRNA sequences of length around 150, with no 
pseudoknot. We set g m ax — 3 which according to the 
study of ref. [IT] (see Fig. 10 therein) is very large. The 
number of sequences predicted with genus (i.e. without 
pseudoknots) is 258, with genus 1 is 500, with genus 2 
is 34 and with genus 3 is 0. Consistently with the re- 
marks made in the context of H-pseudoknots, the results 
indicate that performance of pseudoknot prediction algo- 



rithms could certainly benefit by improving the current 
handling of chain connectivity and excluded volume con- 
straints. 



CPU time 

The CPU time required by McGenus to fold an RNA 
sequence depends on the total number of Monte Carlo 
steps. For a tm-RNA of length 400, the typical num- 
ber of helipoints is 3500. For each sequence, we use 10 
replicas, and overall 3000 x number of helipoints steps 
to achieve these results. The result is typically returned 
in 15 minutes on a parallel quadcore computer (Intel 
Xeon CPU @2.66GHz). The current implementation of 
McGenus on the server is not parallelized. 



CONCLUSION 

In this article, we presented McGenus, an efficient al- 
gorithm for RNA pseudoknot prediction, which proves 
that classifying pseudoknots according to their genus is a 
relevant and successful concept. We showed that on a set 
of RNA structures from the tm-RNA database [13] , Mc- 
Genus allows treatment of sequences of sizes up to 1000 
with a typical CPU time of 15 minutes for a 500 long 
sequence on a quadcore CPU, with a performance that 
is comparable or better than the few methods that can 
treat sequences with comparable length. 

In order to further improve the performance of Mc- 
Genus, we see 3 main directions: I) improvement on the 
computing techniques, in particular on the parallelization 
of the algorithm. II) improvement of the functional form 
and parametrization of the energy model (likely to have 
an impact also on the parametrization of pseudoknot- 
free methods such as UNAFold). Ill) inclusion of steric 
constraints. 
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